{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = pd.read_csv('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/train.csv')\n",
    "test = pd.read_csv('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/test.csv')\n",
    "target=train['sentiment']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 27486 rows and 4 cols in train set\n",
      "There are 3535 rows and 3 cols in test set\n"
     ]
    }
   ],
   "source": [
    "print('There are {} rows and {} cols in train set'.format(train.shape[0],train.shape[1]))\n",
    "print('There are {} rows and {} cols in test set'.format(test.shape[0],test.shape[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font size='4' color='blue'> Fast BERT-lstm model</font><a id='5'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ここでは、BERT埋め込みを用いて、ターゲットキーフレーズを予測するための多入力モデルの構築を試みている。これはナイーブなアプローチであり、後ほど改良を加えていく予定です。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformers\n",
    "from tokenizers import BertWordPieceTokenizer\n",
    "import gc\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "22"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cols=['textID','text','sentiment','selected_text']\n",
    "train_df=train[cols].copy()\n",
    "del train\n",
    "test_df=test.copy()\n",
    "del test\n",
    "gc.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'train' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-6-3536571a9bc3>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtrain\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m: name 'train' is not defined"
     ]
    }
   ],
   "source": [
    "train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Below function is from this [kernel](https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras) by @xhlulu,this is used to encode the sentences easily and quickly using distilbert tokenizer.\n",
    "- 以下の関数は @xhlulu さんのカーネルのもので、 distilbert tokenizer を使って簡単かつ高速に文章をエンコードしています。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fast_encode(texts, tokenizer, chunk_size=256, maxlen=128):\n",
    "    tokenizer.enable_truncation(max_length=maxlen)\n",
    "    tokenizer.enable_padding(max_length=maxlen)\n",
    "    all_ids = []\n",
    "    \n",
    "    for i in tqdm(range(0, len(texts), chunk_size)):\n",
    "        text_chunk = texts[i:i+chunk_size].tolist()\n",
    "        encs = tokenizer.encode_batch(text_chunk)\n",
    "        all_ids.extend([enc.ids for enc in encs])\n",
    "    \n",
    "    return np.array(all_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We load the Distilbert pretained tokenizer (uncased) and save it to directory.\n",
    "- Reload and use BertWordPieceTokenizer.\n",
    "- An implementation of a tokenizer consists of the following pipeline of processes, each applying different transformations to the textual information:\n",
    "- Distilbert pretained tokenizerをロードし、ディレクトリに保存します。\n",
    "- リロードして、BertWordPieceTokenizerを使用します。\n",
    "- トークン化器の実装は、それぞれがテキスト情報に異なる変換を適用する以下のプロセスのパイプラインで構成されています。\n",
    "![](https://miro.medium.com/max/1400/1*7uy9X3eE1rVmqV08yKrDgg.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tokenizer(vocabulary_size=30522, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, wordpieces_prefix=##)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')  ## change it to commit\n",
    "\n",
    "# Save the loaded tokenizer locally\n",
    "save_path = '/home/tidal/ML_Data/Tweet_Sentiment_Extraction/distilbert_base_uncased/'\n",
    "if not os.path.exists(save_path):\n",
    "    os.makedirs(save_path)\n",
    "tokenizer.save_pretrained(save_path)\n",
    "\n",
    "# Reload it with the huggingface tokenizers library\n",
    "fast_tokenizer = BertWordPieceTokenizer('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/distilbert_base_uncased/vocab.txt', lowercase=True)\n",
    "fast_tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Now the comment text is prepared and encoded using this tokenizer easily.\n",
    "- We here set the maxlen=128,(limit)\n",
    "- これで、このトークナイザーを使って簡単にコメントテキストを作成し、エンコードすることができるようになりました。\n",
    "- ここでは、maxlen=128, (limit)を設定しています。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "622e595c139e45caad98c9b4f03fcb37",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=108.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "dd7e89583b4c4ea3a4b6d6b9cb2fae82",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from tqdm.notebook import tqdm\n",
    "x_train = fast_encode(train_df.text.astype(str), fast_tokenizer, maxlen=128)\n",
    "x_test = fast_encode(test_df.text.astype(str),fast_tokenizer,maxlen=128)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  101,  2985,  1996, ...,     0,     0,     0],\n",
       "       [  101,  2821,   999, ...,     0,     0,     0],\n",
       "       [  101,  2758,  2204, ...,     0,     0,     0],\n",
       "       ...,\n",
       "       [  101,  2652, 19219, ...,     0,     0,     0],\n",
       "       [  101,  2156,  1057, ...,     0,     0,     0],\n",
       "       [  101,  5292,  5292, ...,     0,     0,     0]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Now we load the pretrained bert ('uncased') transformer layer.\n",
    "- This is used for creating the representations and training our corpus.\n",
    "- ここで、事前学習されたbert ('uncased') トランスフォーマー層をロードします。\n",
    "- これは表現の作成とコーパスの学習に使われます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer_layer = transformers.TFDistilBertModel.from_pretrained('distilbert-base-uncased')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code is lifted from [kernel](https://www.kaggle.com/gskdhiman/bert-baseline-starter-kernel#Training).\n",
    "- In this section we create the representaion for the selected text from tweet text.\n",
    "- The representation is created such that the positions of tokens which is selcted from text is represented with 1 and others with 0.\n",
    "- for example,consider the tweet `\" I have a cute dog\"` and selected text `\"cute dog\"`\n",
    "- This produces the ouput as ` [0,0,0,1,1]`\n",
    "- ここでは、ツイートのテキストから選択されたテキストの表現を作成します。\n",
    "- 表現は、テキストから選択されたトークンの位置が1、それ以外の位置が0となるように作成されます。\n",
    "- 例えば、`\"I have a cute dog\"`というツイートと、`\"cute dog\"`という選択テキストを考えてみましょう。\n",
    "- これにより、出力は ` [0,0,0,1,1]` となります\n",
    "\n",
    "__t_textが一部おかしい(”##”が入っている)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAX_SEQ_LENGTH_TEXT 108\n",
      "MAX_TARGET_LENGTH 108\n"
     ]
    }
   ],
   "source": [
    "def create_targets(df):\n",
    "    df['t_text'] = df['text'].apply(lambda x: tokenizer.tokenize(str(x)))\n",
    "    df['t_selected_text'] = df['selected_text'].apply(lambda x: tokenizer.tokenize(str(x)))\n",
    "    def func(row):\n",
    "        x,y = row['t_text'],row['t_selected_text'][:]\n",
    "        for offset in range(len(x)):\n",
    "            d = dict(zip(x[offset:],y))\n",
    "            #when k = v that means we found the offset\n",
    "            check = [k==v for k,v in d.items()]\n",
    "            if all(check)== True:\n",
    "                break \n",
    "        return [0]*offset + [1]*len(y) + [0]* (len(x)-offset-len(y))\n",
    "    df['targets'] = df.apply(func,axis=1)\n",
    "    return df\n",
    "\n",
    "train_df = create_targets(train_df)\n",
    "\n",
    "print('MAX_SEQ_LENGTH_TEXT', max(train_df['t_text'].apply(len)))\n",
    "print('MAX_TARGET_LENGTH',max(train_df['targets'].apply(len)))\n",
    "MAX_TARGET_LEN=108"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>textID</th>\n",
       "      <th>text</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>selected_text</th>\n",
       "      <th>t_text</th>\n",
       "      <th>t_selected_text</th>\n",
       "      <th>targets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a3d0a7d5ad</td>\n",
       "      <td>Spent the entire morning in a meeting w/ a ven...</td>\n",
       "      <td>neutral</td>\n",
       "      <td>my boss was not happy w/ them. Lots of fun.</td>\n",
       "      <td>[spent, the, entire, morning, in, a, meeting, ...</td>\n",
       "      <td>[my, boss, was, not, happy, w, /, them, ., lot...</td>\n",
       "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>251b6a6766</td>\n",
       "      <td>Oh! Good idea about putting them on ice cream</td>\n",
       "      <td>positive</td>\n",
       "      <td>Good</td>\n",
       "      <td>[oh, !, good, idea, about, putting, them, on, ...</td>\n",
       "      <td>[good]</td>\n",
       "      <td>[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c9e8d1ef1c</td>\n",
       "      <td>says good (or should i say bad?) afternoon!  h...</td>\n",
       "      <td>neutral</td>\n",
       "      <td>says good (or should i say bad?) afternoon!</td>\n",
       "      <td>[says, good, (, or, should, i, say, bad, ?, ),...</td>\n",
       "      <td>[says, good, (, or, should, i, say, bad, ?, ),...</td>\n",
       "      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>f14f087215</td>\n",
       "      <td>i dont think you can vote anymore! i tried</td>\n",
       "      <td>negative</td>\n",
       "      <td>i dont think you can vote anymore!</td>\n",
       "      <td>[i, don, ##t, think, you, can, vote, anymore, ...</td>\n",
       "      <td>[i, don, ##t, think, you, can, vote, anymore, !]</td>\n",
       "      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>bf7473b12d</td>\n",
       "      <td>haha better drunken tweeting you mean?</td>\n",
       "      <td>positive</td>\n",
       "      <td>better</td>\n",
       "      <td>[ha, ##ha, better, drunken, t, ##wee, ##ting, ...</td>\n",
       "      <td>[better]</td>\n",
       "      <td>[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27481</th>\n",
       "      <td>3dbae74fcd</td>\n",
       "      <td>I want to go to VP, but no one is willing to c...</td>\n",
       "      <td>neutral</td>\n",
       "      <td>I want to go to VP, but no one is willing to c...</td>\n",
       "      <td>[i, want, to, go, to, vp, ,, but, no, one, is,...</td>\n",
       "      <td>[i, want, to, go, to, vp, ,, but, no, one, is,...</td>\n",
       "      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27482</th>\n",
       "      <td>63147b35cb</td>\n",
       "      <td>Wah, why are you sad?</td>\n",
       "      <td>neutral</td>\n",
       "      <td>Wah, why are you sad?</td>\n",
       "      <td>[wah, ,, why, are, you, sad, ?]</td>\n",
       "      <td>[wah, ,, why, are, you, sad, ?]</td>\n",
       "      <td>[1, 1, 1, 1, 1, 1, 1]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27483</th>\n",
       "      <td>bdb196a09f</td>\n",
       "      <td>playing sudoku while mommy makes me breakfast ...</td>\n",
       "      <td>neutral</td>\n",
       "      <td>playing sudoku while mommy makes me breakfast ...</td>\n",
       "      <td>[playing, sud, ##oku, while, mommy, makes, me,...</td>\n",
       "      <td>[playing, sud, ##oku, while, mommy, makes, me,...</td>\n",
       "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27484</th>\n",
       "      <td>18c2a1e98e</td>\n",
       "      <td>see u bye see u!  i love the hot30</td>\n",
       "      <td>positive</td>\n",
       "      <td>i love</td>\n",
       "      <td>[see, u, bye, see, u, !, i, love, the, hot, ##30]</td>\n",
       "      <td>[i, love]</td>\n",
       "      <td>[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27485</th>\n",
       "      <td>1c1f3724db</td>\n",
       "      <td>ha ha, and what game is that? i like games</td>\n",
       "      <td>positive</td>\n",
       "      <td>? i like</td>\n",
       "      <td>[ha, ha, ,, and, what, game, is, that, ?, i, l...</td>\n",
       "      <td>[?, i, like]</td>\n",
       "      <td>[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>27486 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           textID                                               text  \\\n",
       "0      a3d0a7d5ad  Spent the entire morning in a meeting w/ a ven...   \n",
       "1      251b6a6766      Oh! Good idea about putting them on ice cream   \n",
       "2      c9e8d1ef1c  says good (or should i say bad?) afternoon!  h...   \n",
       "3      f14f087215         i dont think you can vote anymore! i tried   \n",
       "4      bf7473b12d             haha better drunken tweeting you mean?   \n",
       "...           ...                                                ...   \n",
       "27481  3dbae74fcd  I want to go to VP, but no one is willing to c...   \n",
       "27482  63147b35cb                              Wah, why are you sad?   \n",
       "27483  bdb196a09f  playing sudoku while mommy makes me breakfast ...   \n",
       "27484  18c2a1e98e                 see u bye see u!  i love the hot30   \n",
       "27485  1c1f3724db         ha ha, and what game is that? i like games   \n",
       "\n",
       "      sentiment                                      selected_text  \\\n",
       "0       neutral        my boss was not happy w/ them. Lots of fun.   \n",
       "1      positive                                               Good   \n",
       "2       neutral        says good (or should i say bad?) afternoon!   \n",
       "3      negative                 i dont think you can vote anymore!   \n",
       "4      positive                                             better   \n",
       "...         ...                                                ...   \n",
       "27481   neutral  I want to go to VP, but no one is willing to c...   \n",
       "27482   neutral                              Wah, why are you sad?   \n",
       "27483   neutral  playing sudoku while mommy makes me breakfast ...   \n",
       "27484  positive                                             i love   \n",
       "27485  positive                                           ? i like   \n",
       "\n",
       "                                                  t_text  \\\n",
       "0      [spent, the, entire, morning, in, a, meeting, ...   \n",
       "1      [oh, !, good, idea, about, putting, them, on, ...   \n",
       "2      [says, good, (, or, should, i, say, bad, ?, ),...   \n",
       "3      [i, don, ##t, think, you, can, vote, anymore, ...   \n",
       "4      [ha, ##ha, better, drunken, t, ##wee, ##ting, ...   \n",
       "...                                                  ...   \n",
       "27481  [i, want, to, go, to, vp, ,, but, no, one, is,...   \n",
       "27482                    [wah, ,, why, are, you, sad, ?]   \n",
       "27483  [playing, sud, ##oku, while, mommy, makes, me,...   \n",
       "27484  [see, u, bye, see, u, !, i, love, the, hot, ##30]   \n",
       "27485  [ha, ha, ,, and, what, game, is, that, ?, i, l...   \n",
       "\n",
       "                                         t_selected_text  \\\n",
       "0      [my, boss, was, not, happy, w, /, them, ., lot...   \n",
       "1                                                 [good]   \n",
       "2      [says, good, (, or, should, i, say, bad, ?, ),...   \n",
       "3       [i, don, ##t, think, you, can, vote, anymore, !]   \n",
       "4                                               [better]   \n",
       "...                                                  ...   \n",
       "27481  [i, want, to, go, to, vp, ,, but, no, one, is,...   \n",
       "27482                    [wah, ,, why, are, you, sad, ?]   \n",
       "27483  [playing, sud, ##oku, while, mommy, makes, me,...   \n",
       "27484                                          [i, love]   \n",
       "27485                                       [?, i, like]   \n",
       "\n",
       "                                                 targets  \n",
       "0      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...  \n",
       "1                         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]  \n",
       "2      [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...  \n",
       "3                      [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]  \n",
       "4                         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]  \n",
       "...                                                  ...  \n",
       "27481  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...  \n",
       "27482                              [1, 1, 1, 1, 1, 1, 1]  \n",
       "27483  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...  \n",
       "27484                  [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]  \n",
       "27485               [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]  \n",
       "\n",
       "[27486 rows x 7 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Now we need to make each output of the same length to feed it to the neural network.\n",
    "- For that we find the maxlength of the target and pad all other target to this length.\n",
    "- ここで、ニューラルネットワークに供給するために、各出力を同じ長さにする必要があります。\n",
    "- そのためには、ターゲットの最大長を見つけ、他の全てのターゲットをこの長さにパッドする。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['targets'] = train_df['targets'].apply(lambda x :x + [0] * (MAX_TARGET_LEN-len(x)))\n",
    "targets=np.asarray(train_df['targets'].values.tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n"
     ]
    }
   ],
   "source": [
    "print(train_df['targets'][27485])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We need to use the sentiment as a feature,for this encode it using LabelEncode.\n",
    "- センチメントを特徴量として使用する必要がありますが、そのためにはLabelEncodeを使用してエンコードします。\n",
    "\n",
    "__labelの出し方は再考の余地あり(sentimentは良し悪しを表すので何か表現できそう)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "lb=LabelEncoder()\n",
    "sent_train=lb.fit_transform(train_df['sentiment'])\n",
    "sent_test=lb.fit_transform(test_df['sentiment'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 1, ..., 1, 2, 2])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sent_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- This is a multi-input model (comment+sentiment label).\n",
    "- I have made a simple LSTM model\n",
    "- concatenated both the inputs \n",
    "- マルチ入力モデル（コメント＋センティメントラベル）です。\n",
    "- 簡単なLSTMモデルを作ってみました\n",
    "- 両方の入力を連結した "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.01664949, -0.06661227, -0.01632868, ..., -0.01999032,\n",
       "        -0.05139988, -0.0263568 ],\n",
       "       [-0.01319846, -0.06733431, -0.01605646, ..., -0.0226614 ,\n",
       "        -0.05537301, -0.02600443],\n",
       "       [-0.01759106, -0.07094341, -0.01443494, ..., -0.02457913,\n",
       "        -0.05956192, -0.0231829 ],\n",
       "       ...,\n",
       "       [-0.0231029 , -0.05878259, -0.01048967, ..., -0.01945743,\n",
       "        -0.02615411, -0.02118432],\n",
       "       [-0.0490171 , -0.05614787, -0.00465348, ..., -0.01065376,\n",
       "        -0.01797333, -0.02187675],\n",
       "       [-0.00646111, -0.0914881 , -0.00254872, ..., -0.01505679,\n",
       "        -0.05040044,  0.04597744]], dtype=float32)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding_matrix=transformer_layer.weights[0].numpy()\n",
    "embedding_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/x_train', x_train)\n",
    "np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/x_test', x_test)\n",
    "np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/sent_train', sent_train)\n",
    "np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/features/sent_test', sent_test)\n",
    "np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/targets/targets', targets)\n",
    "np.save('/home/tidal/ML_Data/Tweet_Sentiment_Extraction/transformer_layer/embedding_matrix', embedding_matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
